I want this to serve as a basic resource for this course.
The materials for this workshop can be accessed on github.
Some of these materials were adapted from Data Carpentry, Ethan White, Simon Peng, Hadley Wickham and Ethan White (and probably many others that I forgot).
The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.
RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
R does not involve lots of pointing and clicking, and that’s a good thing The learning curve might be steeper than with other software, but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again. Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes. Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.
Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis. R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.
An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.
With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. R has packages for image analysis, GIS, time series, population genetics, and a lot more.
The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you. R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.
Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community.
Not only is R free, but it is also open-source and cross-platform Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.
R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment (IDE) that makes using R much easier and more interactive. You need to install R before you install RStudio.
In my opinion, RStudio is the easiest and most efficient way to use R. It’s well worth it to take the time to get comfortable with it.
RStudio has four panes:
R works just like a calculator
Type 10 + 2 into the console and hit enter
10 + 2
## [1] 12
We can do the same with other operators (note: use * for multiplication, not x).
We are not limited to two numbers … we can use as many as you like.
Type 10 + 2 - 8 / 43 and press Enter
8 + 2 - 10 / 43
## [1] 9.767442
These operations follow the general laws of arithmetic (e.g. multiplication and division first, followed by addition and subtraction).
R is a programming language and often the reason we use a programming language as opposed to a calculator is to automate some process or avoid unnecessary repetition.
In this case, we may want to use our result from above in a second calculation. Instead of retyping 10 + 2 every time we need it, we can just create a new variable that stores the result.
The way you assign a value to a variable in R is by using the assignment operator, which is just a ‘less than’ symbol followed by a ‘minus’ sign. It looks like this: <-
Think of the assignment operator as an arrow. You are assigning the value on the right side of the arrow to the variable name on the left side of the arrow.
To assign the result of 10 + 2 to a new variable called x, you type x <- 10 + 2. This can be read as ‘x gets 10 plus 2’. Give it a try now. What happened?
x <- 5 + 7
You’ll notice that R did not print the result of 12 this time. When you use the assignment operator, R assumes that you don’t want to see the result immediately, but rather that you intend to use the result for something else later on. R stores the variable x and its contents in the memory for this R session.
To view the contents of the variable x, just type x and press Enter. Try it now.
x
## [1] 12
This ability to store variables, or objects, and their contents is very useful. It allows us to pass these variables to other calculations, as well as store the output as a different variable.
Now, store the result of x - 3 in a new variable called y.
y <- x - 3
What is the value of y? Type y to find out.
y
## [1] 9
A package is a fundamental unit of reproducible R code. They include reusable functions, the documentation that describes how to use them and typically sample data. Packages in R are basically sets of additional functions that let you do more stuff. The functions we’ve been using so far, like str() or data.frame(), come built into R; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R session when you need it.
You can install a package in RStudio in two ways:
install.packages() functionToday we are going to use a package called the “tidyverse” which is actually a collection of packages designed to work together. This package is recognized by many in the field as being at the forefront of data wrangling. We’ll barely scratch the surface, but I think it’s important that you learn some of these capabilities. Then if needed, you’ll be able to dig in and explore more in your own work.
Once you install a package, you will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function at the beginning of your code file:
library(tidyverse) #this command will show you the libraries that are loaded
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Today we will work with the famous iris data set (collected by Edgar Anderson in 1935). This is a built-in data set available in R. These built in data sets are very helpful for examples. However, I want all of you to understand how to load data into R, so please download the data here.
Best Practice - have working directory automatically set to a common place and use relative paths
Once your project is set up, reload the tidyverse package:
library(tidyverse) #this command will show you the libraries that are loaded
then load the data:
iris.df <- read_csv("SourceData/iris.csv")
## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Species
## dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This creates a dataframe called iris.df. A dataframe is used for storing data tables. It is a list of vectors of equal length. We can create a dataframe from scratch by combining multiple vectors of equal length. However, it is much more common to work with dataframes that are input from an existing table, which we did here.
View the dataframe:
view(iris.df)
However, dataframes are often very large so let’s get familiar with programatic functions to view a particular dataframe.
View the first few rows of the dataframe:
head(iris.df)
or more details about the structure of the dataframe:
str(iris.df)
head() and str() are built in functions. To get help about a function simply type:
?str() #the bottom right box in RStudio will display the help
Suppose we want to create a new identifier named “IDENT” that will contain the letter A-# (e.g. “A-1”) using the the paste() and nrow functions along with the $ operator:
iris.df$IDENT<-paste("A", sep="-", 1:nrow(iris.df))
Have a look at the new attribute:
head(iris.df)
Once you become an R pro, you’ll never want to export your data, but in case you need to:
write_csv(iris, "DerivedData/iris_out.csv")
Notice, we exported the data to the “DerivedData” directory. Have a look at the data in Windows Explorer or Finder.
R is an incredibly powerful tool for working with data - e.g. reformatting data, subsetting data, massaging data, etc. collectively referred to as data wrangling - to get it into the format you need for analysis. This is an important and often time consuming step.
Visualizing your data in a number of different ways allows you to determine the distribution or shape of the data which is helpful as you move into your statistical analysis. This exercise also forces you to find mistakes in the data or spurious values.
BaseR has a number of plot capabilities with no frills:
#Plot a histogram
hist(iris$Petal.Length, main="Histogram of Petal Length")
#Plot a boxplot grouped by species
boxplot(Petal.Length~Species, data=iris,main="Boxplot by Species",xlab="Species")
We can also explore the relationship between sepal length and width using a simple scatter plot.
#Plot a scatterplot
plot(iris$Petal.Width, iris$Petal.Lenth, main="Scatterplot")
ggplot2ggplot2 is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots. The ggplot2 package is included in the tidyverse ecosystem, so we do not need to load it.
Let’s visualize the previous scatterplot in ggplot2:
ggplot(iris.df, aes(x = Petal.Width, y = Petal.Length)) +
geom_point(aes(), shape = 19, color = "blue", alpha =0.4) +# alpha to fade points
geom_smooth(method = "lm", se = FALSE, color = "black") #fit a line
## `geom_smooth()` using formula 'y ~ x'
At this point, you can conduct any number of data visualization or data analysis techniques.
Sticking with the prior example, it seems like there is a pretty strong relationship between iris petal length and petal width. Let’s fit a model to find out!
model1 <- lm(Petal.Length ~ Petal.Width, data = iris.df)
Congratulations - you just ran a regression in R using one line of code!
Next, we’ll examine the output of the model. There are a number of ways to do this, but the summary command will give you everything:
summary(model1)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
R for Data Science by Hadley Wickham and Garrett Grolemund - this is an open source text.
When in doubt, use the google!
**Some of the material for this lesson was adapted by Tim Assal from Data Carpentry Introduction to Geospatial Concepts. Thanks to Leah Wasser; Megan A. Jones; Lauren O’Brien; Joseph Stachelek; Tom Wright; Tracy Teal; Dev Paudel; Jane Wyngaard; Anne Fouilloux; Bidhyananda Yadav; Chris Prener; Tyson Swetnam; Erin Becker; Katrin Leinweber
Spatial data is information about the locations and shapes of geographic features and the relationships between them, usually stored as coordinates and topology.
The two primary types of geospatial data are raster and vector data. Raster data is stored as a grid of values which are rendered on a map as pixels. Each pixel value represents an area on the Earth’s surface. Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features.
Vector data structures represent specific features on the Earth’s surface along with attributes of those features.
Vector objects are either points, lines, or polygons.
Image Source: National Ecological Observatory Network (NEON)
Points: Each point is defined by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: sampling locations, the location of individual trees, or the location of survey plots.
Lines: Lines are composed of many (at least 2) points that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has defined x, y location.
Polygons: A polygon consists of 3 or more vertices that are connected and closed. The outlines of survey plot boundaries, lakes, oceans, and states or countries are often represented by polygons.
Sometimes, boundary layers such as states and countries, are stored as lines rather than polygons. However, these boundaries, when represented as a line, will not create a closed object with a defined area that can be filled.
The spatial extent is the geographic area that the vector data covers. The spatial extent of an R spatial object represents the geographic edge or location that is the furthest north, south, east and west. In other words, extent represents the overall geographic coverage of the spatial object. For example:
Raster data is any pixelated (or gridded) data where each pixel is associated with a specific geographical location. The value of a pixel can be continuous (e.g. elevation) or categorical (e.g. land use). If this sounds familiar, it is because this data structure is very common: it’s how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (or CRS).
Source: National Ecological Observatory Network (NEON)
Some examples of continuous rasters include:
A map of elevation for Harvard Forest derived from the NEON AOP LiDAR sensor is below. Elevation is represented as continuous numeric variable in this map. The legend shows the continuous range of values in the data from around 300 to 420 meters.
Some rasters contain categorical data where each pixel represents a discrete class such as a landcover type (e.g., “forest” or “grassland”) rather than a continuous value such as elevation or temperature. Some examples of classified maps include: